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Abstract 

We consider an example by Haviv (1996) of a constrained Markov decision process that, in some 
sense, violates Bellman's principle. We resolve this issue by showing how to preserve a form of 
Bellman's principle that accounts for a change of constraint at states that are reachable from the 
initial state. 
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1. Introduction 

The most celebrated result in Markov decision process (MDP) theory is Bellman 's optimality 
principle, which can be stated as follows. (We assume that the reader is already generally familiar 
with MDPs.) Let Xt be the state at (discrete) time t and r{Xt, a) the reward received if action a is 
taken at state X, (the stagewise reward). Let V*{x) be optimal cumulative reward starting at state 
X. Then, Bellman's principle states that for each time t, 

V\Xd = max{r(X„a) + Ex„„[r(X,+i)]} 

a 

where X,+i is the random next state with distribution depending on X, and a. Moreover, replacing 
max by argmax on the right-hand side gives the optimal action at Xt (i.e., it characterizes the 
optimal policy). But Bellman's principle is more than just an equation — it embodies an idea that 
has become almost fundamentally axiomatic in Markov decision theory. This idea is that the 
optimal policy solves the optimization problem not just at the initial state Xq = x but also at all 
states reachable from it. 

In this paper, we consider MDPs with explicit constraints. Such constrained MDPs have been 
studied for at least a couple of decades and continues to draw interest (see, e.g., [Hl-lIDj)- We are 
interested here in a particular paper by Haviv [6] , who raises an issue that has not been addressed in 
the literature. Basically, Haviv constructs an example of a constrained MDP in which the optimal 
policy starting at the initial state x is no longer optimal at states other than x, not even at a state y 
that is reachable from x. He laments that this means that Bellman's principle is violated. 
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We will explore Haviv's issue thoroughly. In particular, we will show that there is some preser- 
vation of Bellman's principle, provided we account for the fact that some of the "slackness" in the 
constraint is spent in going from x to a reachable y. So, if we consider the optimal policy n* 
starting at state x, the optimality of n* at state y is with respect to a different problem, one where 
the constraint is modified with the "residual slackness." In analyzing Haviv's problem, we will 
present some known results, some new results, and some related examples along the way to help 
us understand and resolve the problem. Our analysis highlights the important maxim that when 
imposing constraints on a decision problem, the constraints should apply only to those things over 
which we have control. 

2. Haviv's Problem 

In [[61, Haviv gives an example (reproduced in Fig. [T]) in which he shows that, given an optimal 
policy for an optimization problem starting at some state x, the policy is not optimal with respect to 
the same problem starting at a reachable state y. The structure of the problem is a multichain MDP 
with initial state x, which is transient. There are three recurrent subchains that could be reached 
from X. There is no reward for being in chain 1, while the stagewise reward is $10 at every state in 
chain 2 and $20 at every state in chain 3. The constraint is that the expected frequency of visits to 
states in 5 = 5iU52U53 must not exceed 0.125 (think of states in S as the "bad" states). While 
in chain / (/ =1,2, 3), the frequency of visits to Si is as shown in Fig. [T] (e.g., 0.2 for 5 1). There is 
only one state in which an action decision must be made: In state y, we can choose either action a 
or b. 

A quick examination of Haviv's problem shows that there is only one feasible policy: At state 
y, select action a. Selecting action b at state y would violate the constraint, because the resulting 
Markov chain would visit states in S with frequency 0.5(0.2 + 0.1) = 0.15. However, if the starting 
state were y, we would want to pick action b, because this leads to chain 3 where the stagewise 
reward exceeds that of chain 2, and the frequency of visits to S in chain 3 (5 3) is 0.1, which 
does not exceed the constraint of 0.125. As noted before, this leads to Haviv's lament — Bellman's 
principle is violated, because the optimal policy starting at state x is no longer optimal starting at 
state y, even though y is reachable from x. As Haviv points out in H and we will emphasize again 
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later, the issue is related to the multichain nature of the example: that there are transient states and 
recurrent subchains that are not reachable from each other. 

More specifically, Haviv's problem illustrates that as far as optimality of a policy is concerned, 
arriving at state y from x is different from starting at y. From this point of view, the issue raised 
by Haviv appears to be related to that of time consistency in risk averse multistage stochastic 
programming, identified in a recent paper by Shapiro [|T2ll . The same issue is also discussed in the 
economics literature on multistage decision problems arising in dynamic portfolios; see, e.g., ||T3l , 
[[T4l|. This issue has been recognized for some time in the context of time-varying preferences ifTSl . 
[fT6ll and game-theoretic formalisms of such changing tastes ifTTl . iflSll . 

In lamenting the violation of Bellman's principle, Haviv quotes Denardo [fT9| on the principle 
of optimality: "An optimal policy has the property that whatever the initial node (state) and initial 
arc (decision) are, the remaining arcs (decisions) must constitute an optimal policy with regard to 
the node (state) resulting from the first transition." But, we ask, must the policy be optimal with 
respect to the same problem? Indeed, Denardo concedes that: "The term principle of optimality 
is, however, somewhat misleading; it suggests that this is a fundamental truth, not a consequence 
of more primitive things." 

We will show that for a constrained MDP, the optimal policy starting at one state is optimal 
with respect to a problem with a modified constraint at each reachable state. Basically, in going 
from state x to y, we have "spent" some of the constraint, so the "residual" constraint is reduced. 
We submit that this is not an unreasonable predicament, and still satisfies Denardo's version of 
Bellman's principle. Moreover, the articulation of Bellman's principle we derive here is a conse- 
quence of basic optimality conditions (see Theorems [T] and [3]), which we argue are instances of 
"more primitive things" referred to by Denardo. 

3. Bellman's Principle 

3.1. Notation 

We first provide a framework for analyzing MDPs with inequality constraints, of the kind that 
is considered by Haviv [6]. We have to set this up more rigorously than the statement of Bellman's 
equation in the last section, because: (1) we wish to incorporate explicit inequality constraints; 
(2) we consider the case of expected long-term average reward (where Bellman's equation looks 
slightly different); and (3) we need sufficient generahty for multichain problems. For this reason, 
we need some formal notation: 

• State space: X, assumed countable. 

• State sequence: {X,} = {Xo,Xi,X2, . . .}. 

• Stagewise reward: r{x, a) e M 

• Stagewise constraint: c(x, a) e M" 

• If x is a state and a an action, we write Ev,^ for the conditional expectation given (x, a). For 
example, if L* : ^ M is a given function, then Exg,a[L*(Xi)] means that Xi is distributed 
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according to the transition probability distribution given (Xo,a), and Exo,«[L*(Xi)] is the 
conditional expectation of L*{Xi) with respect to this distribution given (Xq, a). If a policy 
n is given, then instead of writing Exo,n{Xa)[L* (X^)], we simply write EjJL*(Xi)]. Similarly, 
given a policy n and an initial state Xo, the distribution of the Markov process {Xt} is well 
defined, and we write EJ^ for the conditional expectation with respect to this distribution. 

• Similarly, for conditional probability given an initial state Xq and policy tt, we use the no- 
tation Fxa.a and P^^. We use "Pj^-a.s." to mean almost surely (with probability one) with 
respect to the probability measure P^^. 

• For a vector x 6 M", we write x >0to mean nonnegativity of each component. 

3.2. Optimal Policy 

Fix a state x e X and set the initial state Xq = x. Let 



1 ^"^ 

V'r(x) = -J]r(X„n(X,)) 

t=0 



and 



W'^{x) = ^YjC(X,,n(X,)) 

The objective function is given by 

V\x) = El 

and the constraint function by 



lim V^(x) 



lim W^(x) 



(1) 



With this notation, the optimization problem given Xq = xis as follows: 

maximize V^ix) 

subject to W^'ix) > 0. (2) 

Note that this form of the problem is sufficiently general to cover other inequality constraints: 
Wix) < 0, Wix) < w(x), etc. 

First, we give sufficient conditions under which a policy is optimal with respect to (|2]). Though 
stated formally, we provide this result not to claim any novelty in it, but merely so that we can use 
it as a rigorous platform on which to frame our analysis. Indeed, similar results can be found in 
the book by Altman [|71, though not exactly in this form (which is constructed explicitly for the 
convenience of our current purposes). We also provide a proof, using only elementary and familiar 
arguments, similar to those in the book by Ross [|20l . 



Theorem 1. Fix x e X and set the initial state Xq = x. Suppose there exist a policy if , a vector 
[I € M", a constant V*{x) e M, and a bounded function L* : X ^ M. such that the following hold: 
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(Al) W'ix) > 

(A2) yU > 

(A3) lu^W'ix) = 

(A4) V*ix) + L*iX,) = maxMXr, a) + m^c(X„ a) + Ex,AL*(X,^,)]}for t = 0,l,..., Pj-a.^. 

(A5) 7r*{X,) 6 argmax„{r(X„ a) + ;U^c(X„ a) + Ex,AL*(X,^i)]} for t = 0,l,..., Pj-a.*. 
Then n* is optimal with respect to ^ and V"'{x) = V*ix). 

Proof. Let tt be a feasible policy. (Note that n* is feasible by assumption (Al).) Then by as- 
sumption (A4), P^^-a.s. for ? = 0, 1, . . . , 

V*{x) + L\Xt) 

= ms.x{r{X„a)+^i'c{Xr,a) + Ex„aW {X,^,)]\ 

a 

> r{X,,n{Xt))+ti'c{X,,n{Xd) + El{L\X,^,)] 

with equality if ;r = tt* (by assumption (A5)). Now multiply throughout by l/T and sum from 
to r - 1 to obtain 

-^r(x) + r(x,)> 

f=0 



1 

- Yj r{X,MXt))+li'c{X,MXd) + El[L*{X,^,)l 



which can be written as 



v\x) + ^r(Xo) > 

V"j{x)+ix'W"j{x) + 



^ T-l 

-Y,ElJL*iX,)]-L*(X,) 



Next, take limits as T ^ oo, take expectation E^^, use the boundedness assumption on L*, and use 
the fact that EJJEJ^^ [L*(X,)]] = E^JL*(X,)] (for ? > 1) to obtain 

V*(x) > V''ix)+p'^W'(x) 

with equality if ;r = tt*. Because n is feasible, Wix) > 0. Hence, because // > by assumption 
(A2), 

V*ix) > V'ix). 
Now, for ;r = :7r*, we use assumption (A3) to obtain 

V*ix) = V'^x), 

and in particular V"'(x) > V"(x). This completes the proof. □ 
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The equation 



V*iXo) + L*iX,) = 

max{r(Xt, a) + c{Xt, a) + Ex„„[L*(X,+i)]} 

a 

has the resemblance of Bellman's equation for the unconstrained case. Indeed, we can think 
of this optimality condition for the constrained case as Bellman's equation associated with an 
unconstrained MDP with stagewise reward r(Xt,a) + ^i^c{X,,a) (called the Lagrangian reward). 
However, the optimality conditions (A1-A5) include not only Bellman's equation, but also condi- 
tions (equation and inequalities) akin to Karush-Kuhn-Tucker (KKT) conditions. This form of the 
optimality condition benefits from the usual interpretation of the multiplier vector /i as a "price" 
vector, and suggests the possibility of approaching the problem using duality principles (though 
we do not pursue this line of approach any further here). 

3.3. Optimality at Subsequent Reachable States 

We say that a state y is reachable from x at time t e {0, 1, . . .} under policy n if, given Xq = x, 
we have ^Xo^X, = y} > 0. We say that y is reachable from x under n if there exists t 6 {0, 1, . . .} 
such that it is reachable at t. 

Next, we show that the sufficient conditions in Theorem[T]are enough for the same L* to satisfy 
the Bellman's equation at every reachable state. 

Theorem 2. Fix x & X. Suppose there exist a policy n*, a vector fi e W\ a constant V*(x) e M, 
and a bounded function L* : X ^ such that assumptions (A1-A5) hold. Then, for each state 
y e X reachable from x under n*, 

V*(x) + Viy) = max{r(y,a)+ti''c{y,a) + E,,«[L*(r)]} 

n*(y) 6 argmax{r();, a) + ;u"'c(j, a) + Ej,,„[L*(X')]}, 

a 

where X' is distributed according to the transition distribution given (y, a). 

Proof. Suppose there is some state y e X reachable from Xq under n* such that V*(x) + L*(y) 4^ 
maXa{r(j, a)+/i'^c(j, a)-l-Ej,_fl[L*(X')]}. Since j is reachable, there is some ? 6 {0, 1,2, . . .} such that 
P^^{X, = > 0. LetA be the event that V*{x)+U{X;) + max„{r(X„a)+yu^c(X;,a)+Ex„a[L*(X,+i)]}. 
Then, by assumption, V\JyPC) = 0. However, we can also write 

Pj;(A) = Pj(A|{X,=3;})Pj;{X,=3;} 

+ P^;(A|{X,^3;})Pj;{X,^3;} 
>P^;(A|{X,=3;})P^;{X,=3;} 
= K^X, = y} 
>0, 

which is a contradiction. 

A similar argument yields n*(y) e maXa{r(y, a) + jJ^ciy, a) + 'Ey a[L*{X')\}. This completes the 
proof. □ 
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The theorem above immediately implies that conditions (A4) and (A5) (along sample paths) 
hold for any state y reachable from x. 

Corollary 1. Fix x e X. Suppose there exist a policy n* , a vector p 6 W, a constant V*{x) e M, 
and a bounded function L* : X ^ K. such that assumptions (A1-A5) hold. Lety & X be reachable 
from X under n*, and suppose we set the initial state to be Xq = y. Then, V"^^-a.s. for ? = 0, 1, 2, . . . , 

V*{x) + L*{X,) 

= max{r(X„a) + p^c{Xt,a) + Ex„a[L* (Xt+i)]} 

a 

n\X,) 

6 Mg\mx{r{Xt,a) + n'' c{X„a) + Ex„«[L*(X,+i)]}. 



The result above shows that Bellman's equation holds at all reachable states. Specifically, 
(A4) and (A5) hold for any state reachable from x (with the objective function value V*{x) and 
multiplier vector p). But this is not enough to show that n* is optimal at any state reachable from 
X. The key hurdle is feasibility (i.e., (Al)). To be specific, suppose that state y is reachable from x 
under n* . In general, it is not true that n* is optimal with respect to the problem 

maximize V^iy) 

n 

subject to W(y) > 0. 

Indeed, it is easy to construct examples for which n* is not feasible for the above problem (e.g., 
Haviv's example [6]). However, a modification to the constraint (which depends on x) gives us an 
optimization problem starting at y for which n* is indeed optimal, as stated below. 

First, we need some additional notation. Given Xq = x, let y be reachable from x at time t 
under n* . Define 

Cy{x) = -E'^;^[W^\XM. ^ 3^] p?|^^^^| - 

Note that if EJ[W*(X,)IX, ^ > 0, then Cy(x) < 0. Moreover, the smaller the value of Pj{Z, = 
y}, the larger the value of |Cy(.x;)|. 

Theorem 3. Fix x e X. Suppose there exist a policy n*, a vector p e W\ a constant V*{x) e M, 
and a bounded function L* : X ^ such that assumptions (A1-A5) hold with Xq = x. Let y be 
reachable from x at time t under n*. Then n* is optimal with respect to the problem 

maximize V"{y) 
subject to W"iy) > Cy{x) 

and V'iy) = V*(x) - iJ^Cy(x). 
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Proof. We have, given Xq = x, 

\im^yciXk,n*(X,)) 

T—>oo ± / I i 



T-l 



k=0 
t 



k=Q 



T-t 



+ 



El 



lim iy c(X,,7r*(X,)) 

T—>oo ± f A 

1 ^"^ 1 

— ^c(X,,7r*(X,)) 

lim — -V c(X,,7r*(X,)) 
r^oo 1 — t ' 

= [W^'iX,)] 

By assumption, W^^*(jc) > 0, which implies that W^' (y)-Cy{x) > 0. Moreover, because fj.^W"'(x) = 
0, we have lu'^iW'iy) - Cy{x)) = 0. 

Now, set the initial condition Xq = y. Define a new stagewise constraint function c(-, a) = 
c(-, a) - Cy(x) (subtracting the same constant for each a) and let (y) = W" (y) - Cy{x), which is 
the expected average constraint function defined accordinglyusing c, analogous to ([!]). From the 
above, we have W^'iy) > and fi^W^'iy) = 0. By CoroUaryjl] Pj-a.s. for ? = 0, 1, 2, ... , 

V*ix) + L\Xd 

= max{r(X„ a) + yU^c(X„ a) + ^x„aW{X,^^)^}. 

a 

Subtract jj^Cyix) from both sides to obtain 

{V\x)-ix'Cyix)) + L\Xd 
= max{r(X„ a) + yU^c(X„ a) + Ex„«[r 

a 

Finally, again by Corollary T| P^^-a.s. for ? = 0, 1, 2, ... , 

n{Xt) 6 argmax{r(X„ a) +yU^c(X„ a) + Ex„a[r(Xj+i)]}. 

a 

Note that if we substitute c for c, the above still holds. We can now apply Theorem[T]to obtain the 
desired result. □ 



The theorem above has the interpretation of Bellman's principle for constrained problems. Re- 
call that in the unconstrained case, this principle states that if n* is an optimal policy for a problem 
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starting at some state x, then it is also optimal for a problem starting at any state y reachable from 
X. The main wrinkle in the constrained case is that the constraint for the problem starting at y 
is different from starting at x, because we have to take into account how the constraint function 
depends on other states that can be reached. Basically, Cy(x) plays the role of a "residual 
slackness" of the constraint at the reachable state y, after the sojourn from x to y. 

Note that if Cy(x) > 0, then constraint is more stringent at y. In this case, we can interpret 
Cy{x) as the constraint that is "spent" in going from x to y. On the other hand, if Cy(x) < 0, then 
we gain some slackness in going from xtoy (i.e., constraint is less stringent). 

3.4. Haviv's Example 

We can use Theorem [3] to construct the optimization problem starting at y for which the given 
policy is indeed optimal. We use the notation 1 5 (•) for the indicator function of 5 , so that I5 (jc) = 1 
if X e S , and Isi^) = otherwise. We have: 

• c(-, a) = 0.125- 15(0 

• Cy{x) = -(0.125 - 0.2)(0.5/0.5) = 0.075 

• c(-, a) = 0.125 - 15(0 - 0.075 = 0.05 - Is(-) 

So, instead of needing the expected frequency of visits to S not to exceed 0.125, at state y the 
constraint becomes 0.05 (more stringent). In other words, we "spent" 0.075 of the constraint in 
going from x to y, and the "residual" constraint starting at state y is that the frequency of visits to 
S should not exceed 0.05. In this case, clearly only action a is feasible at 3^. 

4. Satisfying Haviv 

4.1. Form of Constraint is Bad 

Would our modified form of Bellman's principle satisfy Haviv? We suspect not. Haviv's 
point is that intuition dictates that the optimal policy should pick action b at state y, though he 
acknowledges that such a policy would not be feasible with respect to the problem (|2]). He therefore 
goes on to argue that the form of the constraint in ([2]) is problematic. The version of Bellman's 
principle in Theorem |3] is not entirely satisfactory because, one could argue, the constraint should 
not change depending on what happened in the past. 

This is related to the issue of time consistency in fT2^. Shapiro fT2] defines time consistency 
as "the requirement that at every state of the system our 'optimal' decisions should not depend on 
scenarios which we already know cannot happen in the future." In Haviv's example, once we are 
in state y, we know that we will not enter chain 1 . Yet, it is the frequency of visits to states in S 1 
within chain 1 that causes action b to be infeasible at state y. 

This seems to be a legitimate concern. We further illustrate this concern below by applying 
our result to a different example. This example is in contrast to Haviv's, because it turns out that 
at reachable states that are unlikely to be visited, the constraint might be unreasonably relaxed. 
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Example: Squander or save 

Consider the problem in Fig. |2j Starting at state x, the process will go to either state y or 
z depending on whether or not we win the lottery, respectively. The probability of winning the 
lottery is (realistically) a small number s, as shown in the figure. If we do not win the lottery, we 
have the choice of whether or not to buy a yacht. Depending on our choice, we will end up in one 
of two possible subchains. In the unlikely event that we do win the lottery, we have the choice 
of whether or not to squander all our money. Again, depending on this choice, we end up in one 
of two possible subchains. Within each subchain, the stagewise reward at all states is fixed at the 
value shown in the figure (e.g., 50 in chain 1). These reward values are meant to signify the level 
enjoyment of life within these subchains. 

In this example, the constraint is that the expected frequency of visits to states in 5 = 5iU52U 
53 U ^4 should not exceed 0.3. This constraint reflects the desire that we limit the probability that 
we will go broke (have no money) before retiring. The states in the problem that represent being 
broke are those in 5 = 5iU52U53U54. The frequency of visits to the "bad" states in each of 
the subchains is shown as P{S ,), i = 1 , 2, 3, 4, in Fig. |2] 

It is clear that because s is taken to be very small, it is overwhelmingly likely that we will 
enter state z, in which case we cannot afford to buy a yacht — doing so would send us into chain 3, 
where the frequency of visiting "bad" states is 0.4, exceeding 0.3. But what about in state y, which 
corresponds to winning the lottery? 

In this problem, it turns out that Cy{x) = 0.1(1 - 1/e) < 0. So, depending on how small s is, 
Cy(x) can be made arbitrarily negative. Specifically, for e < 1/11, the optimal action at state y is 
to squander. To be sure, it is not that we can spend more if we win the lottery, but that because it is 
so unlikely that we win, once we win we can do whatever we like without violating the constraint. 
This clearly illustrates that the form of the constraint is problematic, as Haviv points out. 
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P(Si) = 0.1 



P(S2) = 0.05 



P(S3) = 0.4 



P(S4) = 0.2 
Figure 3: Yacht or not 



4.2. Sample-Path Constraints 

How then can we resolve Haviv's problem? Haviv advocates the use of sample-path con- 
straints, where we remove the expectation in the constraint function in ([!]) and require instead that 
the inequality be satisfied with probability one. In our notation, this would correspond to, given 

Xq = X, 

lim W^(x) > P^-a.s. 

It is clear that with such a constraint, a policy n is feasible at x if and only if feasible at each state 
reachable from x. 

Note that this modification to the constraint immediately alleviates the problem illustrated in 
Fig. [2] In contrast to the previous form of the constraint, it would no longer be feasible to squander 
our money even if we win the lottery. 

Example: Yacht ornot 

To illustrate this point further, consider the problem in Fig. |3} which is very similar to Fig. |2] 
but simpler. In the current problem, again we have the (unlikely) event of winning the lottery. 
However, regardless of winning, we have the decision of whether or not to buy a yacht. Depending 
on whether or not we win the lottery and what decision we make about the yacht, we will enter 
one of four subchains, wherein there is some probability of going broke before retiring (as before, 
these are shown as P(S ,), / = 1, 2, 3, 4, in Fig. |3]). The stagewise reward values shown in the figure 
are again meant to signify our enjoyment of life within these subchains. 

As in the problem of Fig. [2} if we impose sample-path constraints, we will quickly arrive at 
the conclusion that in state z, we cannot decide to buy a yacht because doing so would violate the 
constraint in chain 3. However, in state y, where we have won the lottery, we can in fact buy a 
yacht; doing so would not violate the constraint. The optimal choice at state y is indeed to buy a 
yacht, leading to maximal enjoyment of life (within this example). 
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4.3. Satisfying Haviv 

Suppose we make the modification from expected constraint to sample -path constraint in Ha- 
viv's problem. Specifically, we now require that, with probability one, the frequency of visits to 
S not exceed 0.125. Then, no policy is feasible, because there is a 0.5 probability that the process 
will enter chain 1, where the frequency of visits to S is 0.2. Haviv [6] does point this out, but does 
not provide a resolution to it. 

In other words, using a sample -path constraint does not resolve Haviv's problem, because no 
policy would be feasible. Moreover, for some class of constrained MDPs (including Haviv's), 
sample-path constraints can be converted to equivalent expected constraints. These are what we 
might call trans-policy decomposable MDPs (see [IJ). Basically, to convert sample-path con- 
straints into expected constraints in such MDPs, we impose an expected constraint at each sub- 
chain. For example, for subchain C3, use the constraint function c{Xt,a)l[x,eCi] in the expected 
form of the constraint. 

In the problems of Fig. |2] and Fig. [3} for example, we do not need to impose sample-path 
constraints; instead, we can impose the usual (expected) form of the constraint in each of the 
four subchains. If we do so, these problems would no longer suffer from Haviv's problem, and 
the optimal policy would be equally optimal at all reachable states without having to change the 
constraint. 

The conversion of sample-path constraints into equivalent expected constraints highlights an 
issue that is, at heart, what gives rise to Haviv's problem: At state y in Fig.[T| we have no control 
over whether the process will enter chain 1. Indeed, this tells us that when imposing expected 
constraints in the subchains, we should not impose them at all subchains. In particular, we should 
not impose any constraint at chain 1, because we have no control (at y) over whether or not we 
enter it. The constraints should be imposed only at chains 2 and 3, which depend on a decision 
over which we have control. If we do this, then even Haviv's original problem in Fig. [T] would be 
resolved: The optimal policy with respect to initial state x is equally optimal (and feasible) at state 
y, and would select action b as desired. 

Another way to express this observation is that constraints should be imposed only on the 
consequence of decisions, expressing conditions on the desired (or undesired) impact of decisions 
once they are made. In Haviv's example, expressing the constraint at state x does not properly 
reflect the impact of actions at y, which do not control whether or not the system enters chain 1. 
The same would be true even if we modify the example to include action choices at state x (e.g., we 
can control the probability of entering chain 1). The constraint at state x would still not properly 
reflect the impact of actions at y, which do not control entry into chain 1, giving rise to Haviv's 
lament. 
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